Metrics, Statistics, Tests
نویسنده
چکیده
This lecture is intended to serve as an introduction to Information Retrieval (IR) effectiveness metrics and their usage in IR experiments using test collections. Evaluation metrics are important because they are inexpensive tools for monitoring technological advances. This lecture covers a wide variety of IR metrics (except for those designed for XML retrieval, as there is a separature lecture dedicated to this topic) and discusses some methods for evaluating evaluation metrics. It also briefly covers computer-based statistical significance testing. The takeaways for IR experimenters are: (1) It is important to understand the properties of IR metrics and choose or design appropriate ones for the task at hand; (2) Computer-based statistical significance tests are simple and useful, although statistical significance does not necessarily imply practical significance, and statistical insignificance does not necessarily imply practical insignificance; and (3) Several methods exist for discussing which metrics are “good,” although none of them is perfect.
منابع مشابه
Evaluating Information Retrieval Metrics Based on Bootstrap Hypothesis Tests
This paper describes how the bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. More specifically, we describe straightforward methods for comparing the discriminative power of IR metrics based on Bootstrap Hypothesis Tests. Unlike the somewhat ad hoc Swap Method proposed by Voorhees and Buckley, our Bootstrap Sensitivity Methods estimate the overall ...
متن کاملTowards Application-specific Evaluation Metrics
Classifier evaluation has historically been conducted by estimating predictive accuracy via cross-validation tests or similar methods. More recently, ROC analysis has been shown to be a good alternative. However, the characteristics vary greatly between problem domains and it has been shown that some evaluation metrics are more appropriate than others in certain cases. We argue that different p...
متن کاملJustification of Various Bootstraps, Permutation Tests and Rank Tests via a New Inequality for Quantile Functions
Coupled with convergence of various empirical processes in weighted metrics, a new inequality for qf's makes quick work of many limit theorems and allows natural extensions. Research supported in part by NSF grants DMS-8801083 and Organization for Scientific Research (ZWO). subieci ctassijications. 60F05 by Netherlands
متن کاملUsing R to Simulate Permutation Distributions for Some Elementary Experimental Designs
Null distributions of permutation tests for two-sample, paired, and block designs are simulated using the R statistical programming language. For each design and type of data, permutation tests are compared with standard normal-theory and nonparametric tests. These examples (often using real data) provide for classroom discussion use of metrics that are appropriate for the data. Simple programs...
متن کاملNon-Random Sampling and Association Tests on Realized Returns and Risk Proxies
This paper investigates how data requirements can induce a non-random selection of observations from the reference sample to which the researcher wishes to generalize test results. We illustrate the effects of non-random sampling on results of association tests in a setting with data on one variable of interest for all observations, and frequently-missing data on another variable of interest. W...
متن کامل